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Abstract: A parallelized finite difference code based on Newton’s method for systems of non- 
linear elliptic boundary value problems in two dimensions is analyzed in terms of computational 
complexity and parallel efficiency. An approximate cost function depending on 15 dimensionless 
parameters (including discrete problem dimensions, convergence parameters, and machine charac- 
teristics) is derived for algorithms based on stripwise and boxwise decompositions of the domain 
and a one-to-one assignment of the strip or box subdomains to processors. The sensitivity of the 
cost function to the parameters is explored in regions of parameter space corresponding to model 
small-order systems with inexpensive function evaluations and also a coupled system of nineteen 
equations with very expensive function evaluations (a reacting flow model of engineering interest 
which motivates the work). The algorithm has been implemented on the Intel Hypercube, and 
some experimented results for the model problems with stripwise decompositions are presented and 
compared with the theory. In the context of computational combustion problems, multiprocessors 
of either message-passing or shared-memory type may be employed with stripwise decompositions 
to realize speedups of O(n), where n is mesh resolution in one direction, for reasonable n. To realize 
speedups of 0(n 2 ), the total number of mesh points, only hypercubes appear attractive. These 
results must be qualified by hardware assumptions, including sufficient local memory per processor 
to hold all of the data defined on the associated subdomain, and selection of machine parameters 
typical of presently commercially available components. 
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1. Introduction 

Systems of nonlinear elliptic boundary value problems (BVPs) constitute an important class 
of problems in large-scale scientific computation. In many instances, including the example con- 
sidered herein of the detailed modeling of steady multidimensional reacting flows, evaluation of 
the governing equation residuals at a given solution iterate constitutes a significant expense, so 
methods which make efficient use of the function evaluations are required. For such problems, 
robust variations of Newton’s method are often preferable to less fully coupled iterative methods or 
associated time-marching methods (see, e.g., [12]), if memory capacity allows, as we tacitly assume 
it does throughout this paper. 

In contrast, for many other systems of nonlinear elliptic BVPs, such as the steady incompress- 
ible Navier-Stokes problem, the cost of a residual function evaluation is much less than the cost 
of forming and solving the large linear systems arising at each stage of Newton’s method, so that 
various time or time-like relaxation methods requiring less implicitness have traditionally been pref- 
ered. Explicit methods generally parallelize very efficiently, and might thus appear to be superior 
to more implicit methods as parallel algorithms, even where this might not be true in the serial 
context. We demonstrate through complexity estimates that there is a wide range of parameter 
space over which Newton-based methods also parallelize within acceptable ranges of computational 
efficiency, and that this region includes reacting flow problems of current interest. 

The computational complexity model is introduced in general terms, but is ultimately made 
problem-specific for the purpose of reducing the 15 parameters of the model to a manageable number 
of interesting ones. Some of these parameters are unavailable or impractical to obtain from theory 
alone, and are supplied empirically from actual serial runtimes. Parallel numerical experiments for 
realistic reacting flows are as yet impractical on the hardware conveniently available to the authors 
for this study, but results on model problems in representative ranges for the parameters n, the 
resolution of the grid, and p, the number of processors employed, support various testable results 
of the complexity analysis. 

The organization of this paper is as follows. In section 2 we sketch in general terms a mod- 
ified Newton algorithm for the solution of nonlinear elliptic boundary value problems by finite 
discretization methods. Section 3 describes a serial implementation of this algorithm which has re- 
cently been applied successfully to the computation of an axisymmetric, over- ventilated, subsonic, 
laminar methane-air jet diffusion flame. Parallel implementation issues and a complexity theory 
are presented in section 4. Section 5 contains some actual performance data for model systems ob- 
tained on the Intel Hypercube, and a discussion of its implications for modeling realistic systems. 
We conclude in section 6 with some recommendations for the further development of the parallel 
computation of reacting flows. 

2. An Algorithm for Nonlinear Elliptic Boundary Value Problems 

2.1. Newton’s Method for a Generic Nonlinear System 

We consider in this section a discretized elliptic system in the generic form 

F{<(>,x,ir) = 0 , (2.1) 

where <f> is a vector of unknowns, x an d tt are vectors of parameters, and F is a vector-valued 
function in which the components of <f>, x, and ir may enter nonlinear ly. For the sake of definiteness 
in the following discussion, we assume that F arises from a low-order finite difference discretization 
on a tensor-product grid in two dimensions, with m gridpoints in one direction and n in the other, 
and that there are r unknowns defined at each gridpoint. The unknown vector is then comprised 
of ran elements, including boundary values, conveniently enumerated by triple subscripts as 4>kij 
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for k = 1, and F is likewise comprised of rmn scalar equations, 

including discrete boundary conditions. The vector of parameters x consists of s gridfunctions 
which are not necessarily independent of the unknowns <j > , but are in general given by a set of 
explicit algebraic relationships which involve only unknowns defined at a single point of the form 

Xlij = Xl (filij 5 firij) 


for / = 1, . . = 1,. . . ,m;j = 1, . . . ,n. 

The reason for distinguishing between the <f> and x gridfunctions is that the equations for the 
latter contain no spatial derivatives and what would otherwise be a system of order ( r+s)mn can be 
condensed algebraically with modest effort to a system of order rmn. The parameters it are global 
parameters which are independent of the unknowns <j>. As an illustration of this partitioning in the 
modeling of reacting flows, the local mass fraction of a species belongs in <f>, its local coefficient of 
diffusion in the gas mixture belongs in x» and the activation energy of its formation in a particular 
elementary reaction belongs in w. The parameters x and ir will be suppressed in what follows, but 
their presence is felt through the terms involving s in the complexity model in section 4. 

The system (2.1) may be solved efficiently by a damped modified Newton method provided 
that an initial iterate sufficiently close to the solution <f>* is supplied. The damped modified 
Newton iteration is given by 

^(*+0 = 4 &(*) + (2.2) 

where 

6 <t>W (jW)- 1 F(^ fc >), (2.3) 

where the matrix JW is an approximation to the actual Jacobian matrix evaluated at the k th 
iterate. We refer to 6 <f>W as the k th update. When A^*) = 1 and JW = = ${ 4 > {k) ), for all k, 

a pure Newton method is obtained. 

For the pure Newton method, the iteration (2.2) possesses at least a quadratic rate of con- 
vergence when the hypotheses of the Kantorovich theorem are satisfied. That is, ||<^( fc+1 ) — <f>*\\ < 
— ^*H 2 f° r some constant c. For the modified Newton method without restarting (t.e., with 
j( fc ) = | for all n > 0), the rate of convergence is not guaranteed to be better than linear. 
Consequently, there is a tradeoff to be made between the higher cost per iteration of the pure 
Newton method and the higher iteration count of a modified Newton method. This tradeoff has 
been variously resolved in the literature. We follow the practice of restarting our modified Newton 
method whenever the norm of the current update divided by the norm of the first update which 
was obtained with the Jacobian in current use fails to satisfy a bound derived from the Kantorovich 
theorem. This bound depends only on the index of the current update relative to that of the last 
restart (for details, see [13]). 

When the size of the successive modified Newton steps is decreasing in accord with the Kan- 
torovich theorem, no damping is necessary (\W = 1). Since there is no practical means for ensuring 
that the initial iterate lies within the domain of convergence defined by the theorem, damping is 
often necessary in the early stages to prevent divergence. In our implementation, we damp 6 <j>W 
for either of two reasons. First, if bounds are specified by the user for certain components of <f>, 

is chosen as the minimum of 1 and of the value which would drive any component of ^( fc+1 ) to 
its bound. Second, if the projected next update based on damping with the provisional A^, 

6 ^ k+1) = (j( fc ))- 1 F(^ (fc) + A (k) 6 <t> (k) ), 

is larger than the current update 8 ^ k \ then A^*) is recursively halved until this condition no 
longer holds, or until A^ becomes smaller than some tolerance. In the latter case, the algorithm 
unsuccessfully damps to death, and a restart with a better initial condition is recommended. 
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2.2. A Pseudo-transient Continuation Procedure 

A better initial condition for Newton’s method is usually obtainable by an elementary contin- 
uation procedure, namely driving a pseudo-transient form of (2.1), 

D% + m = 0, (2.4) 


where D is a scaling matrix, part of the way towards its steady state [15]. If (2.4) is implicitly 
time-differenced with the backward Euler method, using time step At from a given initial state 4> 
a new system, 


F{4>) = + F(<f>) = 0, 


(2.5) 


results, which possesses the Jacobian 


~ At + d<)> ' 


With an appropriate D and a sufficiently small At , (2.5) may in principal be iterated under the 
successive substitution <j> *— 4> as a time-accurate solution of a physically transient problem. This 
is generally infeasible in terms of execution time, and possibly unproductive because of spatial 
truncation errors which remain finite even as the time step becomes infinitesimal. (The form of the 
algorithm we describe in this paper leaves out consideration of spatial grid adaptivity. This feature 
is important in combustion problems and has been implemented in the serial version of the code, 
but not yet in the parallel version, where load balancing considerations are necessitated.) 

For a succession of time steps which may be larger than what would be required for accurate 
resolution of the physical transient, the iteration of (2.5) is a numerically robust procedure for 
generating an initial iterate for Newton’s method. Moreover, due to the close relationship of (2.5) 
to (2.1), only minor modifications of the steady-state code for (2.1) are required to make it a 
transient/steady-state hybrid. Adaptive control of the size of the time steps used in (2.5), can 
enhance the efficiency of such a hybrid algorithm. One useful strategy is to choose At based 
on the temporal truncation error of the most rapidly varying component of the solution. As <f> 
approaches <f>* and At becomes large, this effects a smooth transition to the (infinite time step) 
steady-state formulation. Alternatively, the growth of At beyond a certain At max can be used to 
trigger automatically a transition to the steady-state formulation. 


2.3. Algorithmic Implementation of the Hybrid Solver 

The Appendix contains two procedures, named SOLVER and NEWTON, which represent 
an implementation of the ideas of this section. SOLVER operates in either of two modes: a 
direct treatment of the steady-state problem, or the steady-state problem preceded by one or more 
adaptive time steps. It has five arguments: the name of a user-supplied subprocedure F (simply 
passed to the internal procedure NEWTON) which evaluates the transient or steady-state residual, 
depending upon the mode; a real solution vector x for communicating the initial iterate upon 
invocation and the converged solution upon a successful return; an integer nj, me for setting the 
mode (if n time > 0, time-stepping is attempted for up to n time adaptively chosen steps); an initial 
time step Afi, which is ignored if n time = 0; and a flag, flagsoLVER, to indicate the state of the 
procedure upon return. There are two other flags in SOLVER which function as logical control 
parameters: flagNEWTONi which communicates the state of the called procedure NEWTON upon 
its return, and flag jacobian, which determines the frequency at which the Jacobian is updated. 
The finest level of control over the Jacobian update frequency lies inside of NEWTON itself, but 
at a coarse level SOLVER forces a refreshing of the Jacobian at the outset of a calculation in either 
mode and also after an unsuccessful time step. There are also two real control parameters internal to 
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SOLVER: a minimum time step A t min and a maximum time step At max - During normal execution, 
the time step At is adjusted upward or downward according to the progress of the iterations in 
a subprocedure which may be problem-specific and is regarded as a black box for purposes of 
this algorithm. A f m ,„ is a lower bound on the size of the time step to keep the computation 
out of inefficient realms. If the procedure attempts to set a time step smaller than A £ min it is 
terminated with a “damped to death in transient mode” message. Recovery from this termination 
calls for a new initial iterate or a willingness to expend lots of CPU time, as indicated by a new, 
lower, input value of At m ,- n . If the procedure attempts to set a time step greater than A t max , 
direct consideration of the steady-state problem is deemed feasible, and the mode is automatically 
switched. The steady-state mode is also switched in by default after the number of time steps 
specified by nu me is exhausted. A failure during the steady-state computation is indicated with 
the flag “damped to death in steady-state mode” . 

The procedure named NEWTON is driven by procedure SOLVER. The mode distinctions of 
SOLVER are invisible to NEWTON; only the meaning of the argument F changes with mode, not 
the way F is handled within NEWTON . The procedure has seven arguments: the name of a user- 
supplied procedure F which it calls to evaluate a transient or steady-state residual; a real solution 
vector x for communicating the initial iterate for the steady or transient F upon invocation and the 
converged solution upon return; an integer n time , a real solution vector from the previous time step 
and a current time step A £*, all three of which are simply passed through NEWTON to F; 
a flag flagffEWTON to indicate the state of the procedure upon return, and a flag flagjACO bian 
to control the frequency of updating of the Jacobian. The internal parameter m counts the number 
of iterations since the current Jacobian was last updated and serves as an index into a table of 
convergence bounds from [13], denoted a m below. There are three internal control parameters, as 
follows. The parameter n mo dijicd is the maximum number of iterations between Jacobian updates 
even if the modified iterations are making their theoretically anticipated progress. The parameter 
A m i„ is a lower bound on the size of the adaptively chosen damping parameter, which is designed 
to abort a calculation started outside of the domain of convergence. Finally, Converge is an absolute 
convergence tolerance, chosen with due consideration to the scaling of the components of x and 
to the machine precision. Note that convergence of NEWTON is based on the (theoretically 
motivated) norm of the update, not on the norm of the residual itself. The size of the residual at 
convergence depends on the condition number of the Jacobian matrix, and may be quite large in 
detailed kinetics combustion problems even after convergence of the solution to machine precision. 

The level of algorithmic detail furnished in the Appendix is sufficient to identify the five basic 
tasks which together account for almost all of the execution time required by the code. In order of 
increasing importance for reacting flow calculations, they are: (1) DAXPY vector arithmetic, (2) 
the evaluation of norms, (3) the solution of linear equations involving the Jacobian matrix, (4) the 
evaluation of governing equation residuals, and (5) the evaluation of Jacobians. 

3. The Physical Model and a Serial Implementation 
3.1. The Continuous Governing System 

The form of the governing equations determines to a large extent the choice of method for 
the linear algebra and the means for evaluating the Jacobian. For definiteness, we cite without 
derivation the following set of elliptic boundary value problems (see, e.^.,[3]). Though typical of 
many reacting flow models, they do not include effects which will in some contexts be essential 
( e.g ., radiation, turbulence), and which could drastically alter the algebraic form of the governing 
system. The problem motivating the present work is a two-dimensional axisymmetric isobaric 
laminar diffusion flame. 

Let r and z denote the radial and axial directions, v r and v z the respective velocity components, 
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and p the density. We introduce the compressible Stokes streamfunction if) such that 



and 


prv z = 


dip 


and the vorticity 

dv r dv x 

The species concentrations by mass are Y*, for k = 1,2 , where K is the total number 
of species in the reaction mechanism. The temperature is T. These principal fields satisfy the 
following equations. 

Streamfunction: 


Vorticity: 

[k (?£) - b (tH)] - b (fb (H) - h H (H) 

+ r *9^; + r2y ( Vf 2 ~ "* ) ■ bo p = °* 

Species Conservation: 

l ( v ‘ d 4) ~ b ( y ‘l) + b <" y * v *> + b (r " y ‘ v *- ) 


( 3 . 1 ) 


( 3 . 2 ) 


— rW^Wk = 0, k = 1,2, 


( 3 . 3 ) 


Internal Energy: 



Other parameters appearing in this system are: the mixture viscosity p, the acceleration of 
gravity g, the diffusion velocities of the k th species in the mixture (V*,, V*,), the specific heat of the 
k lh species c p t, the specific heat of the mixture c p , the thermal conductivity of the mixture A, the 
molecular mass of the k th species Wk, and the generation/consumption rate of the k th species tt >*. 

The system is closed with the multicomponent ideal gas law, 


P = 


pW 

i?T’ 


where W is the mixture molecular mass and p is the pressure. The Arrhenius reaction rate law for 
a system of J reactions gives 

w k = £(vfk ~ ”fk) If) ~ r i \ » 

/- 1 


( 3 . 5 ) 
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where the forward and reverse rates are given by 





(3.6) 


where 

kj = A f j T /3 'exp{-Ej/RT), k r j = kj/K j, (3.7) 

and where, in turn Aj, the pre-exponential rate constants, (3j , the temperature exponents, Ej, 
the activation energies, and Kj, the equilibrium constants, are assumed known. The Vjk are tne 
stoichiometric coefficients of the k th species in the j th reaction. Superscripts P and R on the Vjk 
denote the product and reactant sides of the stoichiometric equation, respectively. Thermodynamic 
and constitutive equations provide the temperature- and composition-dependent specific heats, 
viscosities, thermal conductivities, and diffusion velocities. 

For the calculation of the and the thermodynamic and constitutive properties, we use the 
convenient FORTRAN code packages documented in (6] and [7], respectively. Forming the effective 
mixture transport coefficients (ft. A, and the V*) by appropriate averaging of more fundamental 
quantities defined by the kinetic theory of gases or empirical tables is by far the most expensive 
part of evaluating the governing equation residuals. However, in diffusion flames, accurate diffusive 
transport models are essential. One dimensional studies [8] have shown, for instance, that the 
location of a diffusion-controlled reaction front is shifted significant distances relative to the width 
of the fuel-oxidizer mixing layer as the exponent of temperature variation in a very simple transport 
model is varied within a small neighborhood. 

3.2. Discretization of the Governing System 

The governing equations, along with appropriate boundary conditions, are differenced on a 
two-dimensional tensor product grid which (in the serial code) is generated adaptively from an 
initial coarse grid by subequidistribution of gradients and curvatures of the solution components. 
This concentrates grid points in the regions of high-activity (fronts and peaks) in the domain. 
Second-order differences are used throughout except for gradient boundary conditions and for 
the convective terms of the vorticity, energy and species equations, in which first-order upwind 
differences are employed. This discretization can be accommodated within the standard nine-point 
stencil, and, apart from the source term, insures the diagonal dominance of the Jacobian. 

Ordering the solution components at each gridpoint within a lexicographical ordering of the 
gridpoints (radial index varying more rapidly than axial index) results in a Jacobian which has the 
block nine-diagonal structure indicated in Figure 1. To quantify the overall sparsity, let there be 
m and n gridpoints in the radial and axial coordinate directions, respectively, and r unknowns per 
gridpoint. The r x r blocks must be assumed fully dense to accommodate the most general kinetic 
mechanism and composition-dependence of the transport properties. The fraction of Jacobian 
elements which are nonzero is then approximately 9/(nm). For m and n on the order of 30, which is 
certainly below the minimum required for well-resolved reaction zones, this number is approximately 
1%. It is therefore natural to use a block relaxation method, in which only the diagonal blocks are 
factored by a direct method. The relaxation may be carried out either by throwing the six blocks in 
the wings of the Jacobian to the right-hand side and treating the three contiguous diagonal blocks 
about the main diagonal implicitly (the block-line method), or of throwing all eight off-diagonal 
blocks to the right-hand side (the block-point method). As the outermost loop in the relaxation 
process used to solve the linear problems at each Newton step, the domain is swept repeatedly 
in the axial direction from upstream to downstream. This takes advantage of the approximately 
parabolic nature of all of the governing equations other than that of the streamfunction. Relaxation 
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is, of course, a suboptimal algorithm for solving even sparse linear systems. In order to progress to 
better linear solvers as we progress to denser meshes, we are presently experimenting with a block- 
preconditioned generalized minimum residual (GMRES) method [10] in our serial code. However, 
as will be noted below, the dominant cost of running the code does not necessarily come from the 
linear algebra, and the block relaxation schemes are attractive for their simplicity and their ease of 
parallelization. 

The complexity of the governing equations precludes analytic expression of the elements of the 
Jacobian; therefore a finite-difference approximation to the Jacobian must be used. Extending the 
ideas of Curtis, Powell and Reid [1], as described in more detail elsewhere [14], we can form all of 
the nonzero elements from (9 r + 1) independent vector residual evaluations. Because of the large 
number of function evaluations required to form the Jacobian, its formation may consume nearly 
all of the the running time of the code. To alleviate this burden somewhat, we adopt the practice of 
not re-evaluating the detailed transport coefficients during Jacobian formation but lagging them at 
their values from the previous Newton step residual evaluation. The severity of this approximation 
on the nonlinear convergence rate of the algorithm in comparison to the other major Jacobian 
approximations (the lagging of the Jacobian itself over several Newton steps, and its approximate 
evaluation from finite differences) has not been studied, mainly because the control experiment 
(fresh evaluation of the transport terms with each Jacobian) is prohibitively expensive. 

4- Parallelization Analysis 

4.1. Parallel Implementation Issues 

The large amount of arithmetic to be performed at each gridpoint requiring data defined only 
at that gridpoint suggests the potential for easy parallelization of the detailed kinetics reacting 
flow computation. However, there are potentially three penalties to be paid in distributing the 
overall relaxation-based elliptic solution algorithm over an array of independent processors: syn- 
chronization overhead, communication overhead, and degradation of convergence. These penalties 
are measured indirectly through the speedup and efficiency figures-of-merit of a parallel implemen- 
tation. The speedup is the ratio of the uniprocessor execution time of a given algorithm to that 
of the multiprocessor execution of the same algorithm. The efficiency is the speedup divided by 
the number of processors. For many algorithms, these definitions are unnatural in the sense that 
one would never use the same algorithm in both uniprocessor and multiprocessor environments. 
(Usually better uniprocessor algorithms exist, so the parallel efficiency as strictly defined is inflated 
relative to its advantage.) We adopt measures in which the execution times are obtained from the 
most natural algorithm for each environment. 

The synchronization penalty arises if the processors have data dependencies which cause them 
to cycle between computation and idle time. Even if the processors are programmed homogeneously 
this can happen at convergence checkpoints, for instance, or at other points where an exchange 
of data is required, if they have unequal amounts of work to do. The amount of work to be 
done in a processor is a function of the number of gridpoints in the subdomain assigned to it, 
the type of discrete equations to be enforced at those gridpoints, and possibly variations in the 
character of the solution from one gridpoint to another at which the same equations are enforced. 
(As instances of the latter, consider the nondeterministic arithmetic complexity of obtaining some 
physical parameter required in the governing equations by interpolation from a table or by solution 
of a small nonlinear system of equations at each grid point.) Assuming that the grid points 
are allocated to the processors as evenly as possible (within shape and contiguity constraints) at 
the beginning of each macrocycle between grid adaptations, synchronization delays are relatively 
unimportant in the particular problems we need to consider. The relative number of boundary 
gridpoints (which require somewhat less computational work than interior ones) decreases as the 
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mesh is refined, and though their distribution between the processors becomes more uneven, only a 
small number of processors are thus kept waiting. Similarly, the number of arithmetic operations at 
the interior gridpoints does not vary significantly despite the widely varying solution in our detailed 
kinetics formulation. 

The communication penalty is the time spent exchanging messages between neighboring pro- 
cessors which depend on common data. The significance of this penalty can be great in the present 
context and depends on the absolute amount of data to be communicated, on the granularity of the 
messages into which it is embedded, on its routing between processors, on the amount of arithmetic 
which the algorithm must perform between exchanges, and on the communication-to-computation 
speed ratios of the hardware in question. Two essentially different types of exchanges are required 
when the serial code described in the previous section is parallelized by domain decomposition. 
For the nine-point second-order finite-difference stencil, evaluation of the governing equation resid- 
uals on domains decomposed into strips requires two (vector) exchanges of data per processor per 
iteration, and boxwise decompositions require eight such exchanges. Apart from these pairwise 
exchanges between processors working on adjacent subdomains, global exchanges are required for 
the inner products. The price of distributed memory is therefore heavy local traffic and light global 
traffic. 

In an effort to keep the synchronization and communication overhead down, natural tradeoffs 
may exist which degrade the convergence rate of the algorithm by substituting readily available 
data for the best possible data. In the context of relaxation methods, decreasing the degree of 
implicitness from a Gauss-Seidel method to a Jacobi method is a natural way of decoupling an 
inherently sequential algorithm into independent subtasks. This decoupling may be introduced 
gradually as a function of the granularity of the decomposition, e.g., with two processors one may 
employ block Jacobi between the subdomains but Gauss-Seidel within the subdomains. It can 
be shown theoretically [4] for the model problem of the scalar Poisson equation on the uniformly 
gridded unit square that the asymptotic convergence rates of the the finest granularity point or 
line Jacobi methods are half those of the respective Gauss-Seidel methods, so that twice as many 
iterations are required for an equal error reduction. Therefore, an approximate upper bound 
on the practical parallel efficiency for a processor-saturated Jacobi method applied to the model 
problem is 50%. However, the linear systems which obtain in the reacting flow context are not 
necessarily diagonally dominant and must, from experience, be under-relaxed. The convergence 
penalty incurred when under-relaxed iterations are decoupled is less than that borne by fully- 
relaxed iterations, as the test data of section 5 confirms. Therefore, efficiencies of greater than 50% 
at the processor-saturated limit are certainly not ruled out by this tradeoff. 

In view of the above considerations as well as programming convenience, our parallel imple- 
mentation consists of decomposing the logical tensor product computational domain into logically 
congruent strips or boxes of contiguous unknowns, mapping these subdomains in as neighbor- 
preserving a manner as possible onto network of processors, and programming the processors ho- 
mogeneously. This equidistributes load, minimizes the size and distance of the messages, and 
simplifies the communication patterns. For stripwise decompositions, the natural candidate map- 
pings are onto a ring of processors, a binary-reflected Gray code ring embedded in a hypercube, and 
a bus-connected shared memory machine. For boxwise decompositions, the natural mappings are 
onto a toroidal mesh of processors, a binary-reflected Gray code mesh embedded in a hypercube, 
and a bus-connected shared memory machine. The distributed memory mappings are chosen so 
that subdomains sharing common sides are assigned to processors which are directly connected. 
The only algorithmically significant difference between the distributed memory architectures for a 
given decomposition is then in their overall diameters, that is the maximum distance in message 
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links between any two processors, since all processors must cooperate in inner products and con- 
vergence decisions. For the ring, toroidal mesh, and hypercube, these diameters are p/2, y/p, and 
log 2 p, respectively. See, e.g., [11, 5] for fuller descriptions of these interconnection topologies. 

4.2. Parallel Complexity Analysis 

A complexity analysis of a fairly involved computational procedure is most clearly approached 
in a modular fashion. Accordingly, we begin by constructing operation counts for the five major 
computational subtasks listed at the end of section 2, then combine them in proportion. A total 
of 15 parameters are needed to describe a model with sufficient versatility to cover the region of 
parameter space in which we would like to use the model now and in the near future. Better 
complexity models involving yet further parameters are possible, but are not necessary to obtain 
the basic order-of-magnitude results sought. 

We begin with a fairly general model for the computer itself, since specification of the design 
of hardware well-suited to parallel reacting flow computations is one of our goals. A network of p 
homogeneous processor elements is assumed, each of which has sufficient local memory to represent 
the data of the associated subdomain. We define 7 as the unit of scalar floating point operation 
time. To be precise, we count the scalar multiply-add operation as one 7. A scalar division is 
counted with the same weight, and an a scalar exponential is assumed to require 67. Any vector- 
processing capability of these elements is ignored. (The savings of vectorization are not negligible, 
even as regards the transport and source terms f, but taking advantage of this capability would 
require substantial recoding of our present packages, along with a more complicated complexity 
model.) 

The processors communicate by passing messages or accessing shared memory through a global 
bus, depending upon the interconnection. We define a as the time overhead to initiate and route 
a message of any length between a pair of neighboring processors (which we neglect entirely for 
the shared memory machine) and ft as the additional transfer time per floating point word (which 
is simply the reciprocal of the bus bandwidth for the shared memory machine). For simplicity, 
we assume that only sending messages (or depositing data in the shared memory) takes time. 
Receiving (or reading data) is assumed to be free. Since all sent messages are presumed received 
at some point, any additional reception cost which ought to be included to model a particular 
machine can be incorporated by simple readjustment of a and ft (e.g., doubling them if the two 
costs are comparable). We define three architecture-dependent functions of p: Ce x ,» and Cex,c, 
the local exchange coefficients for processors sharing neighboring sides and corners, respectively, 
and Crc, the global reduction coefficient These coefficients multiplying a and ft take into account 
the degraded ability of the network to exchange data as the number of processors grows. In the 
case of the ring, toroidal mesh, and hypercube, Ce x ,» is one regardless of the number of processors 
because there are channels solely devoted to these message routes in the network. In the case of the 
shared-memory machine, the bandwidth to memory per processor must be divided by the number 
of processors; hence Cex,» is P itself. In the case of the toroidal mesh and hypercube, the two 
distributed memory configurations for which it makes sense to introduce boxwise decompositions 
(and thus neighbors at the corners), Ce x ,c is 2. This is because there are no devoted channels for 
these message routes and each corner-to-corner message must be routed through a mutual neighbor. 
(This is a conservative assumption; we ignore the fact that in some systems, the overhead of sending 
an Ar-hop message is less than k times the overhead of sending a direct, or 1-hop message.) On a 
bus-connected machine, messages have no preferential channels, hence Ce x ,c is the same as Cex ,» , 
namely p. The global reduction operation produces a single scalar from scalars defined on each 
processor, whether by addition in the case of a global inner product whose partial sums are locally 
accumulated, or by the logical “and” operation in the case of a convergence test. Our reduction 


fVincent Giovangigli, CNRS, Paris, personal communication (to appear) 
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algorithms (which, though simple, are beyond the scope of this study; see (11, 5] for a fuller 
discussion) are close to optimal for a reasonable “common denominator” of network software. For 
the ring, toroidal mesh, and hypercube, Cr c is twice the diameter of the network; it is 2 p for the 
shared-memory machine. 

The complexity model depends in addition on basic discrete problem dimensions: the resolution 
of the grid n (hereafter assumed the same in each direction for simplicity), the number of points in 
the finite difference stencil q , the number of components (or governing equations) per gridpoint r, 
and the number of parameters per gridpoint $. (Though q is “hard-coded” as nine in the present 
context, the formulae below are also valid for second-order schemes without cross-derivatives, for 
which q — 5. Generalization to higher-order schemes is not automatic, since an expanded discrete 
stencil would require that the processors share not only the boundary data of their associated 
subdomains, but successively more interior data as well. ) 

In analogy to the machine-dependent coefficients, there are coefficients multiplying the basic 
problem dimensions which depend on the details of the physics and chemistry incorporated into 
the governing equations. A wide variety of such physical models can be represented by simply 
distinguishing between three types of terms: a per-stencil-point-per-component operation count 
Ccd (whose main contribution is from the convective-diffusive terms), a per-component operation 
count Cat (whose main contribution is from the Arrhenius source terms), and a per-parameter 
operation count Cj T (into which all of the transport property calculations at each grid point are 
lumped). 

In terms of these parameters, the major algorithmic subtasks have the following complexities. 
DAXPYs (strips or boxes): 



T 


The DAXPY requires no communication and parallelizes perfectly over any partitioning. 
Global Norms (strips or boxes): 


fn 2 

— i 

L p 


7 + [CRe\<x+ [CR t ]p. 


The norm requires local inner products followed by a global reduction of the partial sums with 
addition. In the spirit of order-of-magnitude estimation, here and below, only arithmetic which is 
done at every gridpoint will be considered. (Thus, we do not consider the comparatively negligible 
cost of normalizing the norm, for instance.) However, scalar communication can be nearly as 
expensive as vector communication when a is large. Thus, no communication is negligible a priori. 

Block Relaxation (strips): 


— ( qr 2 + 4 r) 
P 


1 + [2 C E x,» + Crc] a + (2 nrCEx,» + Cfle] /?• 


Block Relaxation (boxes): 



(qr 2 + 4r) 


1 + [ 4 Cex,» + (? - $)Cez,c + CR e ]a + 


4 —=rCEx,s + (q ~ $)rCEx,c + Crc 

y/P 


0. 


The arithmetic cost of each sweep of the block relaxation reflects the formation of a modified right- 
hand side (which includes a term in r 2 from each Jacobian block which is not treated implicitly), 
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back-substitution using pre-factored Jacobian blocks (either block tridiagonal or block diagonal, 
additional terms in r 2 ), a DAXPY update to the solution vector, and a convergence check. In 
the case of the stripwise decomposition, each pair of neighboring processors must exchange their 
bounding rows of data consisting of n gridpoints. Neighbors in boxwise decompositions are of two 
types: those that must exchange bounding rows of data consisting of n/^/p gridpoints, and those 
that must exchange corner points of data only. (The latter drop out when a 5-point stencil is 
employed.) 

Function Evaluation (strips): 


r n 2 


—(CcDqr + CatT + Ctts) 


1 + [2 C Ex ,»] a + [2n(r + s )Cez,*\ 0- 


Function Evaluation (boxes): 


\n 2 


{CcDqr + C Ar r + C Tr s) 


7+ 


[4Cj Zx,» + (? - 5)C Ex , c )ai + 


4-^(r + s)C Ex , t + (q - 5)(r + s)C Ex<t 


0 - 


The function evaluation arithmetic cost is independent of decomposition, but the communication 
requirements are different for the same geometrical reasons discussed in the block relaxation step. 

Jacobian Evaluation (strips): 

["p 0 ?r + 1 ^ C ° Dqr + CArT ) + 2qr 2 + 2r + 5 f3 )] 7+ 

[(2(gr + 1) + (q - 3))C£ x ,,]a + [(2($r + l)n(r + s) + (q - Z)nr 2 )C Ex> , ] 0. 

Jacobian Evaluation (boxes): 


^( qr + 1 ){CcDqr + CUrO + 2 qr 2 + 2 r + j 7+ 

[4(?r + 2 )C Ex<l + (q - 5 )(qr + 2)C Ex<e )ot+ 



((qr + l)(r + s) + r 2 )C Ex ,» + (q — 5)((?r + l)(r + s) + r 2 )C Ex<c 


0 . 


The Jacobian evaluation cost includes (qr + 1) independent function evaluations, in which the 
dependence on the detailed transport through Cjy is neglected, along with some finite difference 
arithmetic performed on the function evaluations to generate the matrix elements, and some pre- 
processing which is amortized over all of the block relaxation sweeps utilizing the Jacobian. The 
stripwise method preprocessing cost of 7r s /3 reflects an LU-block factorization of the diagonal 
block and the solution of r-dimensional linear systems involving the columns of the two neighbor- 
ing off-diagonal blocks. In the boxwise method (the most fine-grained limit of which is block-point 
relaxation) the preprocessing cost is just r 3 / 3 per gridpoint for factorization of the diagonal block. 
The communication terms include both the sharing of subdomain boundary data needed for the 
function evaluation and the shipping of Jacobian blocks between neighboring processors. The latter 
is necessary in order to exploit the Curtis- Powell- Reid (CPR) technique refered to in section 3. In 
the CPR method, it is easier to compute, e.g., gV't* at gridpoint (i,j) than it is to compute 
OF’ ' 

$$-^ 1 , where j represents any of the governing equation residuals at the point and fcj any of 
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dF' 

the unknowns. However, it is which is actually needed locally for the linear algebra. If the 

point (i,j) in question lies at the low-j end of a subdomain then this quantity is formed at the 
high-j end of the neighboring subdomain as - fyfV , and so forth, then the blocks are swapped. This 
approach is justified since the block-swapping requires no arithmetic and it does not contribute to 
the leading communication term in a when r is large. The extra function evaluations which would 
otherwise be required would contribute to the leading terms in each of the brackets multiplying 7, 
a, and /?. 

If the linear systems involving the Jacobian are solved to a level of convergence which is 
independent of the granularity of the decomposition, then the number of hybrid Newton steps is 
independent of the type of decomposition and the number of processors, f We may therefore 
use the hybrid Newton step as the largest aggregate in the complexity model, and work out the 
complexity on a per-call-to-NEWTON basis. Neglecting the overhead associated with possible 
problem-dependent damping and selection of time steps, we can state that each hybrid Newton step 
consists of one DAXPY, two norms (one of the solution vector update for determining convergence, 
and one of the steady-state residual norm for additional monitoring), one function evaluation, 

L relaxation sweeps, and l/K Jacobian evaluations. The last two parameters of the model are 
therefore the average number of relaxation sweeps per Newton step, L, and the average number 
of Newton steps between Jacobian evaluations, K. For systems as complicated as (3.1)-(3.7), no 
theory exists on either L or K. We must therefore obtain L and K from problem-specific serial 
experiments, and be content to extrapolate them into other regions. 

To illustrate the usefulness of the complexity analysis while keeping the size of this paper within 
reason, it is convenient to focus on individual points in physical parameter space and to study the 
parallelizability as a function of n, p, and the dimensionless ratios a = a/7 and 6 = /?/ 7 alone. 
For this purpose, we adopt the specific 16-species, 46-reaction mechanism of [9] for the oxidation 
of methane. We then have for the basic problem dimensions: q = 9, r = 19 (with unknowns 
.,Y\g), and s = 20 (with parameters p,c p ,p,X, Di,D 2 ,. . . , Dig). The operation 
count coefficients are approximately: Ccd & 33, C\ T fa 270, and Cj r fa 2000. The coefficient Cat 
is obtained directly from counting the operations required to evaluate the species and energy source 
terms, taking the sparsity of the sums and products of (3.5)-(3.7) into account, and dividing by 
r. The other two coefficients are too difficult to estimate directly from the code itself, so they are 
fitted to C^r for a particular q , r, and s by the simple formula: 

CUi-r _ CcDir _ ^TrS 

m m rn J 

Tat Tod Tjr 

where T^ r , Tcd, and Tx r are actual serial CPU timings of the respective portions of the code which 
performs the function evaluation. L fa 200 and K fa 5 are also available from actual runs of the 
serial code on a grid of approximately 1200 nonuniformly-spaced points. 

4.3. Parallel Performance Analysis 

With the preceding formulae and assigned values, we can estimate the execution time T of the 
parallelized solver (normalized to the floating point speed of the processors) through a relation of 

the form: 

T 

~ = A(«,P) + afa{n,p) + bf/}(n,p). (4.1) 

Under the additional assumptions that a = 100 and that 6=1 the leading-order communication 
terms of can easily be extracted, leading to the entries under a f a , and bfp in Table 1. Minimizing 

fThis condition may not be satisfied for relaxation methods in which convergence is based solely upon the size of the 
update, since the deteriorating condition number of the iteration matrix with finer granularity may allow a larger residual at 
convergence. However, this has a relatively small effect on the total number of Newton steps. 
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T/') as a function of p leads to the optimal execution time, T op t, for a given n, a, and 6. The number 
of processors which minimizes T/') for each of the six mappings described above is given in the 
fourth column of Table 1 in orders of n. The fifth column contains the actual optimum number 
of processors for the case n — 64 when all the terms of the complexity model are taken into 
account, and the last column shows the fraction of the total execution time which is devoted to 
communication at this optimum. (The value n = 64 is selected because one-dimensional studies of 
methane- air diffusion flames [8] suggest that this should be sufficient resolution in either direction 
for approximate grid-independence of the discrete solution.) 


Mapping 

■2SH 

a f a / 10 4 

6//J/10 4 

Popt.d 

Popt 

comm. frac. 

strips/cube 

B5BSI 

4.0 log 2 p 

1.1 n 

o(n 2 ) 

64 

0.014 

strips/ring 

Kasai 

2.0 p 

1.1 n 

O(n) 

64 

0.028 

strips/bus 

ftUSSI 

- 

1.1 np 

0(n 4 / 2 ) 

64 

0.39 

boxes/cube 


4.0 log 2 p 

2.1 np -1 / 2 

■m™ 

4096 

0.43 

boxes/mesh 


4.0 p 1 / 2 

2.1 np" 1 / 2 


3686 

0.70 

boxes/bus 


- 

2.1 np 1 / 2 

EbEB 

206 

0.52 


Table 1: Optimal processor estimates for six subdomain-to- 
architecture mappings, and estimates of the fraction of the 
total execution time devoted to communication at the opti- 
mal p, assuming n = 64, a = 100, and 6=1. 

In the case of stripwise decomposition on a hypercube, the optimal order of n 2 is too optimistic, 
since it the algorithm can only employ as many processors as there are strips. (By construction, 
1 < p < n for the stripwise decompositions and 1 < p < n 2 for the boxwise decompositions.) The 
communication fraction remains very small (less than 2%) when this decomposition is processor- 
saturated at p = n = 64. From a user perspective, this order of speedup reduces an hour of serial 
execution time to approximately one minute. 

On a ring, the communication fraction for processor-saturated case is only slightly worse (about 
3%). The only difference in complexity between the two cases is a larger global reduction time due 
to the fact that the ring interconnect is less rich. However, global reductions form a small proportion 
of the total running time in this region of parameter space. 

In the case of stripwise decomposition on a bus-connected shared memory machine the bus 
becomes choked with local exchanges as p increases leading to an optimal order of n 1 / 2 . However, 
this is too pessimistic for reasonable values of n and the chosen machine parameters since the 
constant in front is large and the optimum actually lies on the boundary. Although the execution 
time continues to drop as processors are added to the choked bus, the efficiency drops considerably. 
In the processor-saturated case, nearly 40% of the time is devoted to communication. 

In the case of boxwise decomposition on a hypercube, the optimal order estimate of n 2 (which 
is derived from the balance between / 7 and f a alone) is justified by the full model, in that 64 2 = 
4096 processors can usefully be employed. However, at this processor-saturated limit the speedup 
is limited since over 40% of the time is devoted to communication. (For a = 100 and 6=1, 
the communication is dominated by the global reduction operation.) Nevertheless, from a user 
perspective, a processor array of this size could reduce about 40 minutes of serial execution time 
to approximately one second. 

The other architectures do not perform as well for the boxwise decomposition. The difference 
between the 2 log 2 p global reduction coefficient for the hypercube and the corresponding 2 y/p for 
the mesh is significant for p on the order of 2 12 , and the optimal order falls to n 4 / 3 . This is somewhat 
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pessimistic for the problem under consideration because the constant in front is greater than one, 
and the execution time is actually minimized at p = 3686. However, communication costs 70% of 
the execution time at this point. 

The bus-connected shared memory machine becomes hopelessly choked with a mere 206 pro- 
cessors under this decomposition. Further increases in the number of processors actually increase 
the execution time. At this point, over 50% of the execution time is spent swapping data. The 
optimal order of n 2 / 3 is not much better than in the stripwise decomposition case. 

The optimum number of processors has been defined above as the number which minimizes 
the execution time. However, the “optimum” might actually be smaller if the marginal efficiency 
of the processors is small and a cost- weighted objective function is used. Because of this important 
consideration, we conclude this section with the presentation of surface plots of efficiency over a 
range of of n and p for the complexity model defined above. (For plotting clarity, the efficiency is 
set to zero in the inaccessible ranges of p > n for stripwise decompositions or p > n 2 for boxwise 
decompositions.) Results are given for all six mappings for the machine parameters a = 100 and 
6=1, and for the two hypercube mappings (which are the most promising) at a = 1000 and 6=1, 
in order to check the sensitivity of their performance to the message initiation overhead. Figure 2 
contains the strip-based mappings, and Figure 3 the box-based mappings. 

It is apparent from Figure 2(d) that the hypercube/strip mapping can bury the cost of initiating 
messages over a rather slow network, whereas the performance of hypercube/box mapping shown 
in Figure 3(d) has deteriorated to possibly uneconomical levels. Further calculations (not plotted) 
show that the hypercube/strip mapping maintains greater than 70% efficiency even at a = 10,000. 

In all the calculations of this section, L was set to the constant value 200 obtained from serial 
runs with the block-line algorithm. This crude assumption potentially endangers the results, making 
them appear too optimistic at large n and p, since L appears linearly in the leading terms of the 
communication. (It does not appear in the leading terms of the arithmetic, since the Jacobian and 
function evaluations dominate.) By analogy with a scalar Poisson equation, L might be expected 
to increase like n 2 at a fixed p, by as much as a factor of 2 at fixed n over the full range of p for 
the stripwise decomposition, and by as much as a factor of 4 over the full range of p in the boxwise 
decomposition case (see, e.y., [4]). However, experience indicates that these scalar Poisson-based 
iteration estimates are pessimistic for the Jacobians of two-dimensional strongly convective systems 
of reacting flow equations. Still, to be conservative, these optimal p estimates should be regarded 
as valid only in the neighborhood of the parameter space in which they were derived. In particular, 
they are not uniformly valid in n. In order to model performance conservatively over a wide region 
of parameter space with the convenient assumption of a constant L, the worst-case L for the region 
(based on its extreme point in n and p) should be used. This number is not yet experimentally 
available to us for n = 64, the extreme resolution in Figures 2 and 3. 

5. Parallel Experiments for Model Problems 

The solution algorithm described in section 2 has been implemented on the 32-processor Intel 
iPSC-5 (with the expanded local memory configuration) and some results are presented in this 
section for a few simple choices of F(^). These results serve three main purposes: (i) they validate 
the parallel code itself, (it) they provide a test for the complexity theory of section 4 in the case of 
constant L, and (Hi) they illustrate that the dependence of the overall performance of the code on 
variations in L with n or p is weaker than one would expect from the model problem, under various 
conditions which typify our reacting flow calculations. These conditions are: under-relaxation of 
the linear solver, strong (upwinded) convection, and large r with strong coupling of the r unknowns 
at each point. 

The model Poisson problem is 


-V 2 u = / 
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on the unit square with homogeneous Dirichlet boundary conditions, where f(x, y) is chosen so 
that the exact solution is u(x, y) = sin(7ra;) sin(7ry). The initial iterate is u = 0. The stripwise 
decomposition is applied to a uniform 32x32 grid, using 1, 2, 4, 8, 16, and 32 processors. Because the 
problem is linear, convergence is obtained in one Newton step. An absolute convergence tolerance of 
10 -6 on the normalized 2-norm of the update is applied to the linear iteration, in which a relaxation 
factor of unity is employed. 

The results appear in Table 2. The first three rows display convergence data: the number 
of iterations required for a given granularity of the decomposition, this number normalized to the 
uniprocessor case, and the relative reduction of the 2-norm of the residual at “convergence” . One 
observes that the number of iterations required for the processor-saturated p = 32 (block-line 
Jacobi) case is nearly double that of the p = 1 (block-line Gauss-Seidel) case. It is less than 
the theoretical asymptotic factor of two since this factor is based on equal error reduction, and 
such a condition is not guaranteed with an update-based convergence criterion. The next three 
rows give overall performance measures: the execution time in seconds (which includes a single 
Jacobian formation at the outset), the relative speedup obtained from each doubling of the number 
of processors (ideally 2 each time), and the efficiency ( E = T\/{pT p )). Though the speedup is 
respectable (from a little over half-an-hour to three minutes), the efficiency is fairly dismal (32%) 
at the processor-saturated limit due to two factors: the large communication-to-arithmetic ratio 
of the r = 1 problem and the increase of L with p. (The average number of relaxation sweeps 
per Newton step L is, of course, the same as the tabulated / for this linear problem.) The latter 
dependence can be removed by considering the execution time per iteration. Performance data 
based thereon, analogous to the middle three rows, appears in the last three rows of Table 2. 
By this measure, the processor-saturated efficiency is 58%, and its departure from unity is due 
exclusively to communication-related factors. 
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32 

Ip 
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832 

Ip/h 

1.00 
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1.43 

1.85 

11%11/IMI 

9.5(-4) 

1.0(-3) 

1.2(-3) 


1.6(-3) 

2.0(-3) 

T p (sec.) 

1840. 


539.3 

314.1 


182.5 

i 

- 


1.81 

1.71 

1.47 

1.17 

E 

1.00 

■9 


0.73 


0.32 

t p (sec.) 

4.08 

2.04 

HBEliiS 

0.57 

0.33 


tp- i/t p 

- 

2.00 

1.89 

1.89 

1.73 


e 

1.00 

1.00 

0.94 

0.89 

0.77 

0.58 


Table 2: Intel Hypercube data for the model linear problem, 

- V 2 u = / in the unit square, with u> = 1.0, for different num- 
bers of processors, p. I p is the number of iterations required 
for convergence, ||r/ p ||/||ro|| is the ratio of final to initial resid- 
uals, T p is the execution time, E is the overall efficiency of the 
p- processor algorithm, t p is the per iteration CPU time, and 
e is the per iteration efficiency of the p-processor algorithm. 

The last row of the table can be directly compared with a theoretical efficiency surface plot 
for the hypercube/strip mapping of the model problem, which appears in Figure 4(a). This plot 
was generated using the complexity model of the previous section with j = 9 f, r = 1, s = 1, 


f Though the model problem could be differenced with a five-point formula, the nine-point Jacobian is presently hard-coded 
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Ccd = 1*3, Cj\ r = 29 CTr = 0, L = 451, and K = 1, along with <* = 1000 psec, 0 — 8 /xsec, and 
7 = 40 psec 5, or a = 25, 6 = 0.2. An interesting trend which is barely discernable in Figure 2 (b) 
and (c), but readily apparent in Figure 4(a) is the increase in parallel efficiency with n along the 
processor-saturated diagonal (p = n). The leading-order arithmetic terms grow faster with n than 
the leading-order communication terms for the hypercube/strip mapping, so that parallel efficiency 
improves with grid refinement. We emphasize that this is an artifact of a constant L. 

Figure 4(b) contains two sections of the surface in (a), showing the theoretical efficiency at 
two fixed values of n as p varies from 1 to n. Plotted as circles on (b) are the data from the 
last row of Table 2 and analogous results on an n = 16 problem (not separately tabulated). The 
agreement of the crude complexity model with experiments is better than one has a right to expect. 
We emphasize that we would never use a relaxation algorithm to solve this simple problem, even 
in parallel, since many efficient alternative solution techniques exist and have been parallelized 
(FFTs, multigrid, conjugate-gradient-based domain decomposition, etc.). However, it is a useful 
test problem since reacting flow problems often possess regions in which diffusion dominates or 
co-dominates. In addition, the streamfunction equation is simply a variable-coefficient Poisson 
equation. 

Table 3 illustrates the decline in the iteration penalty associated with loss of implicitness when 
the updates are under- relaxed. The problem is the same as the previous one, except that oj = 0.5, 
which is the value we are typically forced to use in solving systems based on the reacting flow 
Jacobian. Though the total number of iterations is higher at all granularities, it grows more slowly, 
reaching a maximum of 1.28 times the uniprocessor total, rather than 1.85 for the fully-relaxed 
calculation in Table 2. The overall parallel efficiency is accordingly higher. The per- iteration 
efficiency data is identical (to within the precision of the table) to that of Table 2, since L is 
normalized out, and therefore is not repeated here. 
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Table 3: Intel Hypercube data for the model linear problem, 

— V 2 u = / in the unit square, with u> = 0.5, for different 
numbers of processors, p. I p is the number of iterations re- 
quired for convergence, ||r/ p |l/||ro|| is the ratio of final to ini- 
tial residuals, T p is the execution time, and E is the overall 
efficiency of the p-processor algorithm. 

Table 4 illustrates the decline in the iteration penalty which is normally associated with elliptic 
character when the system is, in fact, approximately parabolic and the relaxation sweeps are done 
in the proper direction. The problem (from [2]) is 


— V 2 u + cu y = 0 


into the solver. 

$The Ccd and C^ r operation count coefficients are based on detailed timing tests on a single iPSC node. The C^ r term 
includes the cost of evaluating the sine functions in /. 

5 Double precision is performed in software. 
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on the unit square with the inhomogeneous Dirichlet boundary conditions u(0,y) = tt(l,y) = 1, 
u(a:,0) = 1 + sin(Trx), u(x, 1) = 1 + 2sin(7rar), whose solution possesses a downstream boundary 
layer whose resolution requirements we ignore. The stripwise decomposition is applied to a uniform 
32 x 32 grid, using 1, 2, 4, 8, 16, and 32 processors. The initial iterate is u = 0. Since this is simply 
another scalar linear problem, we present only the iteration counts in Table 4. 

Values of the dimensionless Reynolds/Peclet number c in the jet configuration which motivates 
this study are typically in the range 10 2 to 10 3 . Therefore we consider both magnitudes below. 
(The first row of Table 2 gives the c = 0 limit.) In order to indicate the effect of n upon L, we also 
consider two resolutions, n = 16 and n = 32, at each Reynolds/Peclet number. 
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Table 4: Intel Hypercube data for the strongly convective 
linear problem, — V 2 u + cu y = 0 in the unit square, for differ- 
ent numbers of processors, p. I p is the number of iterations 
required for convergence with c and n as indicated. 

There are three major trends in Table 4. The first is that the number of iterations required 
in this range of c is roughly an order of magnitude less than for c = 0, for the same n and p. In 
nonlinear problems especially, this translates into less communication per Jacobian and function 
evaluation, and thus into higher efficiencies. One also observes that doubling the resolution of the 
grid generally less than doubles the number of iterations for a given c and p. In contrast, for the 
model Poisson problem, doubling the resolution theoretically quadruples the number of iterations. 
In the other limit, c — ► oo, of course, the number of iterations is independent of the resolution 
altogether, namely p itself. Thirdly, the number of iterations may more than double for a given c 
and n as p increases over its full range, the c — ► oo limit being a case in point. In this respect, the 
Poisson problem has a milder dependence of L upon p. However, in that problem, L is much larger 
to begin with. 

Of course, the effect of convection on the reacting flow operator is far more complex than can 
be suggested here, since there is one equation in the governing system (3.1) in which convection 
plays no role at all. In addition, convection is not generally a co-dominant term within the reaction 
zone, so that it is insufficient simply to march through this zone a small number of times. 

Finally, to verify the handling of nonlinearities and to illustrate improved overall efficiency 
with large r, we consider the problem 


-V 2 u* + c exp[^a w u/] = /*, k - 1, . . .,r 

i=i 

on the unit square with homogeneous Dirichlet boundary conditions, where a*j = +1 for / < k and 
a*/ = —1 for l > k, and where the /* are the same as in the model problem, so that all of the 
w* reduce to the solution thereto when c = 0. Here we take c = 4(n — l) 2 so that the nonlinear 
term is co-dominant. However, no time-stepping is employed, and no damping is triggered by this 
particular problem for the initial iterate u = 0. 

The stripwise decomposition is applied to a uniform 32 x 32 grid, using 16 and 32 processors, 
for 1, 2, 3, and 4 unknowns per gridpoint. In order to keep execution times under one hour, smaller 
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numbers of processors (and larger numbers of unknowns) were not used; therefore direct efficiency 
calculations are not possible. The results appear in Table 5. Notice that for a given n and p, L levels 
off as r increases and the intrapoint couplings (which are handled directly) become more important 
relative to the interpoint couplings (which are not). The relative speedups are improving with r at 
this processor-saturated end of the scale due to the increasing arithmetic-to-ccmmunication ratio 
It is primarily this effect which grants the high efficiencies of the parallelized reacting flow solver. 



r = 1 

2 

3 

4 

N 

15* 

17* 

21 

24 

J 

5 

6 

6 

6 

F 

20* 

23* 

27 

30 

1 16 

927 

1531 

2042 

2332 

hi 

1160 

1926 

2561 

2924 

Lie 

61.8 

90.1 

97.2 

97.2 

L 32 

72.5 

120.4 

121.9 

121.8 

Pi6 (sec.) 

318.2 

957.4 

2109. 

3576. 

232 (sec.) 

262.7 

704.3 

1470. 

2406. 

T\e/Tz2 

1.21 

1.36 

1.43 

1.49 

<16 (sec.) 

0.34 

0.62 

1.03 

1.53 

<32 (sec.) 

0.23 

0.37 

0.57 

0.82 

<16/<32 

1.55 

1.68 

1.81 

1.87 


Table 5: Intel Hypercube data for the model coupled nonlin- 
ear problem, - V 2 u k +c exp(^[ =1 a*juj) = /*, for A; = 1, . . . , r 
in the unit square, with u = 1.0, for four different numbers of 
components, r = 1, ... ,4, and two different numbers of pro- 
cessors, p = 16, 32. N is the number of Newton iterations, 
J the number of Jacobian evaluations, and F the number of 
function evaluations. I p is the iteration count for the linear 
relaxation sweeps, totaled over all N Newton steps, and L p 
is the average per Newton step. T p is the CPU time in sec- 
onds, and is the per iteration CPU time in seconds. [* The 
asterisked figures are for the 16-processor case. Due to the 
slight difference in size of the terminal residuals of the linear 
iterations within each Newton step between the two different 
processor configurations, the number of Newton steps and 
function evaluations varied slightly. In the 32-processor case 
N and F were each one greater for r = 1, and one less for 
r = 2. The non-asterisked figures for N, J, and F are the 
same for both processor configurations.] 


6. Conclusions 

We conclude with a discussion of an instance of the “inverse problem” of parallel computing, 
namely, “Given problem X and algorithm Y, what kind of parallel computer should it be executed 
on?” For a generic two-dimensional nonlinear elliptic system characterized by very expensive 
function evaluations, a Newton-like method with a relaxation- based solver of either block-line or 
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block-point type forms a practical algorithm. The block-line form of the algorithm parallelizes well 
on a ring or hypercube network of processors, and efficiently employs any number of processors less 
than or equal to n, the resolution in the coordinate direction transverse to the lines, while tolerating 
a relatively high message initiation to floating point speed ratio. Shared memory machines are also 
useful for this problem, particularly in the coarse-grained limit. This implies that several of the 
currently available parallel clusters containing small numbers of powerful nodes will be of great 
interest to reacting flow modelers. 

To achieve acceptable running times, however, modelers ultimately will not be content with 
speedups of O(n). The block-point form of the algorithm also lends itself to parallel implementation, 
and extends the number of processors that can effectively be put to use beyond 0(n ) into the 0(n 2 ) 
range without any necessity of decoupling the strongly interrelated unknowns defined at a single 
gridpoint. The block-point version places more restrictions on the type of interconnect, however. 
With realistic message overheads a hypercube is required for high performance. No hypercubes of 
the power and dimension required for this algorithm exist today; however, their development in 
the near future is both likely and to be welcomed. 

We have considered only two-dimensional detailed kinetics models. Parallelism is not a crit- 
ical issue in zero- or one-dimensional detailed kinetics models, and no projections are offered on 
three-dimensional models. Though the analysis of this paper could easily be extended to three 
dimensions, practical three-dimensional models will almost certainly require better methods for 
the discretization and the linear algebra. (Particle-based methods with their own very different 
parallelization considerations may ultimately replace continuum-based methods in both two and 
three dimensions.) 

To borrow the now famous phrase of C. Moler, detailed kinetics reacting flow modeling is an 
example of an “embarrassingly parallel” computation. The Newton/time-stepping hybrid algorithm 
will run well on fine-grained versions of technologically feasible machines. With reasonable spatial 
resolution, on the order of 2 10 or more processors may usefully be employed, although it is likely 
that better algorithms will be developed before the present one is implemented on hardware of this 
scale. The basic property of reacting flows leading to the ease of parallelization - the large amount 
of zero-space-dimensional arithmetic required in the evaluation of transport and reaction rate terms 
- will only be accentuated following present trends towards heavier fuels and the development of 
progressively more complicated mechanisms. 
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8. Appendix 

The Appendix contains procedural level language descriptions of the procedures SOLVER and 
NEWTON described in section 2. 


Procedure SOLVER( F, x, n lime , A<i, flagsoLVER ) 
X() * — x 

flag JACOBIAN “refresh” 

If {ntime > 0) Then 
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For k = 1 Step 1 Until nu me Do 
Repeat 

x * If! 

Call NEWTON( F, x , njime, xk-u At k , flagNEWTON y flag jacobi an ) 
If ( flag newton = “converged”) Then 
it-*- x 

Break-Repeat 

Else 

Afjfc ■*— Atk/2 

If (At* < At m i„) Then 

flag solver “damped to death in transient mode” 

Return 

EndJf 

x •*— Xfl 

flag jacobian <— “refresh” 

EndJf 

End-Repeat 

compute next time step, At^+i 
If (Atfc+i > A t max ) Then 
Break -For 
EndJf 
End_For 

flag jacobian *- “refresh” 
name <- 0 

EndJf 

Call NEWTON( F, x, nj tme , x k -i, A t k , fl agNEW ton •> flag jacobian ) 

If ( flagNEWTON =“converged”) Then 

x 4— x 

flagsoLVER “converged to steady solution” 

Else 

flag solver “damped to death in steady-state mode” 

EndJf 

Return 

End-Procedure 


Procedure NEWTON ( F, x, x*-i, At k , flagNEWTON > flag jacobian ) 

X 4— X 
m 4— l 

For k — 1 Step 1 Until unewton Do 
If {flag jacobian = “refresh”) Then 
compute new Jacobian, J 
flag jacobi an «- “recycle” 

EndJf 

If (m = 1) Then 

Call F( x, ntimt , x k . u A t k , F ) 

Ax* J~ l F 

compute initial damping parameter, A 
£ 4— ||Ax|| 
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Xgave <— x 
Repeat 

x *— x + A Ax 
& x look * J 1 F i x ) 

£look * 1 1 Axj 00 fc [ 1 

If (€look ^ £ converge ) Then 

flag newton *— “ converged " 

X *- X 

Return 

ElseJf (£ look < 02 £ And A = 1) Then 
m *— 2 

€«ave * $ 

Break-Repeat 
ElseJf (^jooi < f) Then 

flag Jacobian <— “refresh” 
Break-Repeat 

Else 

A •*— A/2 

If (A < A min ) Then 

flagNEWTON * — “damped to death” 

Return 

End-If 

X * Xgave 

EndJf 

End-Repeat 

Else 

a: +- a: + A ar {oo * 

Call F( x, n t i me , xt—i, A £*, F ) 

A x/ 0 o* + J~ l F 

book <— ||Ax/ 0 ot|| 

If ( ^look < £ converge ) Then 

flagNEWTON «- “converged” 
x *— x 

Return 

ElseJf (&<,<>* And m < n mo dified) Then 

m +— m + 1 

Else 

m +— 1 

flag jacobian <— “refresh” 

EndJf 

EndJf 

EndJ'or 

flagNEWTON *- “failed to converge” 

Return 

End-Procedure 


22 


References 

[1] A. R. Curtis, M. J. Powell Sc J. K. Reid, On the Estimation of Sparse Jacobian Matrices, J. 

Inst. Math. Appl., 13 (1974), pp. 117-119. 

[2] E. C. Gartland, Approximation of a Nonsymmetric, Singularly Perturbed Problem, G. Birkhoff 

Sc A. Schoenstadt ed., Elliptic Problem Solvers II, Academic Press, New York, 1984, 
pp. 545 555. 

[3] A. D. Gosman, W. M. Pun, A. K. Runchal, D. B. Spalding Sc M. Wolfshtein, Heat and Mass 

Transfer in Recirculating Flows, Academic Press, New York, 1969. 

[4] L. A. Hageman Sc D. M. Young, Applied Iterative Methods, Academic Press, New York, 1981. 

[5] S. L. Johnsson, Communication Efficient Basic Linear Algebra Computations on Hypercube 

Architectures, Technical Report 361, Dept, of Computer Science, Yale University, 
September 1985. 

[6] R. J. Kee, J. A. Miller Sc T. H. Jefferson, CHEMKIN: A General-Purpose, Transportable, 

Fortran Chemical Kinetics Code Package, Technical Report SAND80-8003, Sandia 
National Laboratories, March 1980. 

[7] R. J. Kee, J. Warnatz Sc J. A. Miller, A FORTRAN Computer Code Package for the Evaluation 

of Gas-Phase Viscosities, Conductivities, and Diffusion Coefficients, Technical Report 
SAND83-8209, Sandia National Laboratories, March 1983. 

[8] D. E. Keyes Sc M. D. Smooke, Flame Sheet Starting Estimates for Counterflow Diffusion 

Flame Problems, Technical Report ME- 102-86, Dept, of Mechanical Engineering, Yale 
University, March 1986. (to appear in J. Comp. Phys.). 

[9] J. A. Miller, R. J. Kee, M. D. Smooke Sc J. F. Grcar, The Computation of the Structure 

and Extinction of a Methane-Air Stagnation Point Diffusion Flame, Technical Report 
WSS/CI 84-20, The Combustion Institute, 1984. (presented at the 1984 Spring 
Meeting of the Western States Section of the Combustion Institute, University of 
Colorado, Boulder). 

[10] Y. Saad and M. H. Schultz, GMRES: A Generalized Minimum Residual Algorithm for Solving 

Nonsymmetric Linear Systems, Technical Report YALEU/DCS/RR-254, Dept, of 
Computer Science, Yale University, August 1983. 

[11] , Data Communication in Parallel Architectures, Technical Report YALEU/DCS/ 

RR-461, Dept, of Computer Science, Yale University, March 1986. 

[12] M. D. Smooke, Solution of Burner-Stabilized Pre-Mixed Laminar Flames by Boundary Value 

Methods, J. Comp. Phys., 48(1982), pp. 72-105. 

[13] , An Error Estimate for the Modified Newton Method with Applications to the Solution 

of Nonlinear Two-Point Boundary Value Problems, J. Opt. Theory and Appl., 
39(1983), pp. 489-511. 

[14] M. D. Smooke, R. E. Mitchell Si J. F. Grcar, Numerical Solution of a Confined Laminar 

Diffusion Flame, G. Birkhoff Sc A. Schoenstadt ed., Elliptic Problem Solvers II, 
Academic Press, New York, 1984, pp. 557-568. 

[15] M. D. Smooke, J. A. Miller Sc R. J. Kee, Solution of Premixed and Counterflow Diffusion 

Flame Problems by Adaptive Boundary Value Methods, Num. Meth. for Two-Point 
Boundary Value Problems, 5(1985), pp. 303-317. 




Figure 1: A schematic of the Jacobian matrix of the 9-point finite-difference operator showing its 
sparsity structure, (a) upper left: a two-dimensional 5x6 grid with the stencil corresponding to 
{t,j) = (3,3) highlighted, (b) upper right: gross sparsity structure of the Jacobian, showing the 
coupling between rows, (c) lower left: an enlargement of the cross-hatched block of (b), showing the 
coupling between points within a row. (d) lower right: an enlargement of the double-cross-hatched 
block of (c), showing the coupling between degrees of freedom within a point. 
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Figure 2: Surface plots of the parallel efficiency of the relaxation-based nonlinear elliptic solver 
on the methane-air axisymmetric jet diffusion flame problem for stripwise decompositions. The 
resolution of the domain in each dimension, n, runs between 2 and 64. The number of processors 
employed, p, runs between 1 and n. (a) upper left: bus-connected shared memory architecture with 
6 = 1 (o = 0). (b) upper right: ring architecture with a = 100 and 6=1. (c) lower left: hypercube 
architecture with o = 100 and 6=1. (d) lower right: hypercube architecture with a = 1000 and 
6=1. 


25 



Figure 3: Surface plots of the parallel efficiency of the relaxation-based nonlinear elliptic solver 
on the methane-air axisymmetric jet diffusion flame problem for boxwise decompositions. The 
resolution of the domain in each dimension, n, runs between 2 and 64. The number of processors 
employed, p, runs between 1 and n 2 . (a) upper left: bus-connected shared memory architecture 
with 6 = 1 (a = 0). (b) upper right: mesh architecture with a = 100 and 6=1. (c) lower left: 
hypercube architecture with a = 100 and 6=1. (d) lower right: hypercube architecture with 
a = 1000 and 6=1. 
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Figure 4: Theoretical (curves) and actual (plotted circles) parallel efficiency of the relaxation-based 
nonlinear elliptic solver on the model problem for stripwise decompositions on the Intel iPSC-5. (a) 
left: surface plot of theoretical efficiency over 2 < n < 64, 1 < p < n. (b) right: graph comparing 
theoretical and actual efficiency vs. p at n = 16 and n = 32. 
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